TD - Chapter details v2

Chapter details

One of the ATLAS basic concepts is the multilinguality of the content stored within the ATLAS applications. This has several implications:

Most documents that are imported into the applications database are only available in a subset of the targeted languages, commonly they will be single-language documents. This leads to the necessity of translating them into the other languages.

Obviously, translating queries "on-the-fly" is computationally expensive, so an a-priori solution has to be identified. In this case, the situation is fortunate: all documents are translated during or before the indexing stage so that at a robust search-engine level, the retrieval mechanism can be language-agnostic, i.e. search for text occurrence without the need of query translation.

Cross-linguality is therefore achieved in a programmatically transparent fashion, i.e. by ignoring language at a broad level and following tagged data at finer levels. Since the index documents are processed by the ATLAS NLP pipelines beforehand, the text language is already known and tagging is a trivial task.

The integration of Clir in Atlas consists of two main parts – storing and querying.

Storing

When an Atlas item is saved it is automatically sent for further content-based processing such as text mining, translation, summarization and categorization. In the translation part the item text properties and its text mining excerpts are translated using one of the translation providers. The translated text is stored in a multilingual rdf file. The latter is then sent to the Clir engine, to be stored as a multilingual content item.

Querying

One of the priorities of the target CMS is multi-linguality. This also involves the desire for cross-lingual functionalities such as placing a query in one language and retrieving documents in a specific (different) language, specifically retrieving automatically translated response documents.

When a user is searching for a term in Atlas the search request is also processed by the Clir engine. If the user searches for ‘rabbit’ and the original content item is in German and contains ‘Kaninchen’ the regular search will not return the item as a result. However, the Clir engine will find the rdf document for the content item and its translations. Based on the content item identifier Atlas returns the content item, which has the translation of the search term.

Technical

Integration of CLIR in Atlas